Vaccines have had critics from the beginning of development. However, recently in the United States and around the world, vaccination has faced increasing skepticism and vaccination rates have begun to decline. Anti-vaccination rhetoric is especially prevalent on social media, where the platforms serve as an open space for misinformation to spread.
The goal of this study is to download Twitter data and identify tweets as pro-or-anti-vaccination. Tweets often have an associated location, thus this would allow researchers to locate communities where anti-vaccination sentiment is growing. This could help healthcare professionals identify communities that are at higher risk of infectious diseases.
In this document, we will provide start-to-finish methods on how to identify Twitter users and tweets that are “anti-vaccination.”
This case study consists of data exploration and visualization using Natural Language Processing techniques, a social network analysis with the igraph package, and a sentiment analysis of the results.
The libraries used in this study are listed in the following table, along with their purpose in this particular case study:
| Library | Purpose |
|---|---|
stringr |
Parsing text with regular expressions |
tidytext |
Loading important datasets and manipulating text data |
dplyr |
Dataframe manipulation |
ggplot2 |
Plotting sentiment |
wordcloud |
Creating wordcloud visuals |
SnowballC |
Word stemming |
igraph |
Performing social network analysis |
threejs |
Graphing large-scale social networks (compatible with igraph) |
visNetwork |
Graphing social networks with labels |
In order to run this code please ensure you have these packages installed.
If you would like to download your own Twitter data, you will need to sign up for a Twitter developer account, create an application, and enter your own keys & credentials in the “What is the data?” section below.
The learning objectives include data cleaning and manipulation with natural language processing and regular expressions, social network analysis, semi-supervised learning concepts, and sentiment analysis.
To collect data from Twitter, we suggest using the rtweet package. Again, if you would like to download your own Twitter data, you will need to sign up for a Twitter developer account, create an application, and enter your own keys below. We’ve set the code chunk eval = false so that when the entire file is knit the code in the chunk is ignored.
library(rtweet)
twitter_token <- create_token(
app = "twitter_app_name",
consumer_key = "XXXXXXXXXXXXXXX",
consumer_secret = "XXXXXXXXXXXXXXXXXXXXX",
access_token = "XXXXXXXXXXXXXXXXXXXXXXXXXXX",
access_secret = "XXXXXXXXXXXXXXXXX")Once the twitter_token is create it is stored in the work space and you are free to use all the functions in rtweet!
The search_tweets function can download 18,000 every 15 minutes (with the free developer account). However adding retryonratelimit = TRUE automatically waits for the rate limit to be reset (after 15 min) and then searches for the remaining tweets. This is repeated until all requested tweets are downloaded.
We search for the 18,000 most recent tweets (type = "recent") that contain the hashtags “antivax” or “vaccineswork”, as well as just the word “vaccines”. include_rts = TRUE means we are including retweets. This is important for this case study: we want to be able to connect users to one another if they share the same opinion. We can assume if a user retweets a tweet without adding any text to it (which might state their opposition to the tweet) then they agree with the tweet.
We set lang = "en" to capture only tweets in English.
st <- search_tweets('#antivax OR vaccines OR #vaccineswork',
n = 18000,
type = "recent",
include_rts = TRUE,
lang = "en")After collecting the tweets with the rtweet package, we exported the data as a csv file. This is an important step because the data you collect with the API can be very different every time. It also takes a few minutes to download new tweets.
Before exporting as a CSV we removed the rows that
Next, we’ll import a set of data that was created with the exact steps as above.
Before doing anything, we should load the libraries needed for the case study.
library(kableExtra)
library(stringr)
library(lexicon)
library(tidytext)
library(dplyr)
library(ggplot2)
library(kableExtra)
library(tidyr)
library(tm)
library(topicmodels)
library(wordcloud)
library(SnowballC)
library(topicmodels)
library(igraph)
library(threejs)
library(visNetwork)Now we upload the data as a dataframe with read.csv and look at all the fields/features/columns by printing out the colnames. Each row of the dataframe is information about one tweet. We also print out the first row with head(tweets_df,1) to get an idea of what kind of information is in each row and column.
tweets_df <- read.csv("data/date02_13_n5k_key3_uuidgens_final.csv",
stringsAsFactors = FALSE,
encoding="UTF-8")
colnames(tweets_df)## [1] "status_id" "created_at"
## [3] "text" "source"
## [5] "display_text_width" "reply_to_status_id"
## [7] "is_quote" "is_retweet"
## [9] "favorite_count" "retweet_count"
## [11] "ext_media_type" "lang"
## [13] "quoted_status_id" "quoted_text"
## [15] "quoted_created_at" "quoted_source"
## [17] "quoted_favorite_count" "quoted_retweet_count"
## [19] "quoted_followers_count" "quoted_friends_count"
## [21] "quoted_statuses_count" "quoted_location"
## [23] "quoted_description" "quoted_verified"
## [25] "retweet_status_id" "retweet_text"
## [27] "retweet_created_at" "retweet_source"
## [29] "retweet_favorite_count" "retweet_retweet_count"
## [31] "retweet_followers_count" "retweet_friends_count"
## [33] "retweet_statuses_count" "retweet_location"
## [35] "retweet_description" "retweet_verified"
## [37] "place_url" "place_name"
## [39] "place_full_name" "place_type"
## [41] "country" "country_code"
## [43] "location" "description"
## [45] "url" "protected"
## [47] "followers_count" "friends_count"
## [49] "listed_count" "statuses_count"
## [51] "favourites_count" "account_created_at"
## [53] "verified" "account_lang"
## [55] "lat" "lng"
## [57] "uuidgens_user" "uuidgens_retweet"
## [59] "uuidgens_quote"
## status_id created_at
## 1 1.095809e+18 2019-02-13 22:15:38
## text
## 1 @NicArt24 @Jordan_Sather_ No additional abortions need to be performed to continue creating the vaccine.
## source display_text_width reply_to_status_id is_quote
## 1 Twitter Web Client 78 1.095808e+18 FALSE
## is_retweet favorite_count retweet_count ext_media_type lang
## 1 FALSE 0 0 NA en
## quoted_status_id quoted_text quoted_created_at quoted_source
## 1 NA <NA> <NA> <NA>
## quoted_favorite_count quoted_retweet_count quoted_followers_count
## 1 NA NA NA
## quoted_friends_count quoted_statuses_count quoted_location
## 1 NA NA <NA>
## quoted_description quoted_verified retweet_status_id retweet_text
## 1 <NA> NA NA <NA>
## retweet_created_at retweet_source retweet_favorite_count
## 1 <NA> <NA> NA
## retweet_retweet_count retweet_followers_count retweet_friends_count
## 1 NA NA NA
## retweet_statuses_count retweet_location retweet_description
## 1 NA <NA> <NA>
## retweet_verified place_url place_name place_full_name place_type country
## 1 NA <NA> <NA> <NA> <NA> <NA>
## country_code location
## 1 <NA> High in the Rocky Mountains
## description
## 1 A simple country Democrat, who loves God and USA, and dreams of a world without violence, bigotry or poverty.
## url protected followers_count friends_count
## 1 https://t.co/3z9yfjCDh8 FALSE 4544 5001
## listed_count statuses_count favourites_count account_created_at
## 1 2 7096 10536 2016-04-14 15:07:53
## verified account_lang lat lng uuidgens_user
## 1 FALSE en NA NA a9c0a85d-f02c-4185-bbc8-ba1aab670e33
## uuidgens_retweet uuidgens_quote
## 1 <NA> <NA>
In a separate file, unique IDs were generated for each user so that all identifiers could be removed in order to avoid assuming or exploiting any political or religious views and/or the state of someone’s health.
We describe a few of the important columns in the following table:
For each row \(i\),
| Column | Description |
|---|---|
text |
Raw text from tweet \(i\) |
uuidgens_user |
Unique ID of the user who tweeted (or retweeted) tweet \(i\) |
uuidgens_retweet |
Unique ID of the user that tweet \(i\) was retweeted from |
created_at |
Date and time tweet \(i\) was tweeted |
display_text_width |
Number of characters in tweet \(i\) |
location |
Location the tweeting user has listed in their Twitter bio |
description |
Description the tweeting user has listed in their Twitter bio |
long and lat |
Longitude and latitude coordinates |
Using regular expressions to parse and clean text is an important skill when dealing with text data. Let’s take a look at the data to see what we need to clean by printing the head, or first 6 lines, of the text column - the raw text from the tweets we downloaded.
## [1] "@NicArt24 @Jordan_Sather_ No additional abortions need to be performed to continue creating the vaccine."
## [2] "@Jordan_Sather_ The cells were obtained more than 50 years ago, as a result of elective abortions — and today the cells are more than three generations removed from their origin. https://t.co/6T8qxdKkW3"
## [3] "Sad but true. Thanks anti-vaxers! #antivax #VaccinesWork #VaccinateYourKids #science https://t.co/h2N8Y29Im3"
## [4] "Gizmo, a 10-year-old male Rex cat, came in despite the weather for his annual exam and vaccines. Great job Gizmo! Good health takes priority. #rexcat #bothellpethosp #catsofinstagram #catsoftwitter #meow #vetmed #vettechs #animalspnw https://t.co/jJpkAqBu4z"
## [5] "@DarlaShine My children’s daycare director received the polio vaccine in the 70’s. Ended up getting it as a result. She has no feeling or movement in her legs and has to use crutches. Yet.....she still requires every child to be fully vaccinated. Bc she’s smart and knows death is worse. \U0001f64c\U0001f3fb"
## [6] "@markgongloff @DarlaShine My children’s daycare director received the polio vaccine in the 70’s. Ended up getting it as a result. She has no feeling or movement in her legs and has to use crutches. Yet.....she still requires every child to be fully vaccinated. Bc she’s smart and knows death is worse. \U0001f64c\U0001f3fb"
## [7] "Deliberate disinformation is behind a massive measles outbreak in Washington https://t.co/cJYbbSjYtK https://t.co/zUSYZuAO07"
## [8] "@FO7935 @Spacebunny21 @Viol3t Why are you even here except to be a advocate for chemical cocktails called vaccines. If you have pumped your children full of them (actually child abuse), then wtf difference does it make if I choose to not to? If they work, your kids are \"safe\"."
## [9] "The CDC declared that #Measles had been eliminated in the U.S. 19 years ago – but right now, there is an outbreak in 10 states. \n@NBCNews Medical Contributor @DrNatalieTV joins @SRuhle and @AliVelshi to discuss everything you need to know.\n \nhttps://t.co/MRxb4bkoHy"
## [10] "Sharyl Attkisson and measles vaccine math – wrong in so many ways https://t.co/vSic7TtNXX Via @skepticalraptor"
From this small peak at the data, we can already see that there are a lot of strings that are not real words. We should remove text that does not help and, in fact, may hinder our analysis. For example, we observe lots of links (“https…”) and UTF characters (“\U000…”) that aren’t giving us any insightful information.
To do this, first we create a string named pat (short for pattern) that contains regular expression or “regex” patterns that match with various unwanted characters or phrases.
pat <- "[\r\n]|&.|@.*?[ \t\r\n]|@.*?$|https:.*[ \t\r\n]|https:.*$|[^[:alpha:][ \t\r\n]?&/\\-#.']"Note that the “|” character separates each pattern the string. For example, the first pattern [\r\n] matches with newline characters, and is followed by a “|” to signify the end of that regular expression.
Let’s break down each of the strings in pat and explain what text it is meant to match with
| Regex Phrase | Text match |
|---|---|
[\r\n] |
Carriage returns and newline characters |
&. |
“&” phrase |
@.*?[ \t\r\n] |
Tagged user handle followed by white space (already done) |
@.*?$ |
Tagged user handle at the end of the tweet (already done) |
https?:.*[ \t\r\n] |
Embedded link followed by white space |
https?:.*$ |
Embedded link at the end of the tweet |
[^[:alnum:][:blank:]?&/\\-#.] |
Remove UTF-8 strings but keep any “?&/\-#.” characters |
Let’s break down one of the above patterns, @.*?[ \t\r\n], to see how regular expressions work.
First, @ simply matches with the @ sign, the beginning of tagging a user on Twitter. In regular expression “.” can match with any character except for newline \n. The “*” character matches 0 or more of the previous character, which in the case below is the “.” or any character. So now we have matched with anything beginning with @. The “?” quantifier is non-greedy, meaning it matches as few characters as possible. We want to include this so that we remove the minimum amount of text and thus only the username tag. We stop removing text when we reach a space, tab, carriage return or newline character, [ \t\r\n]. Thus, @.*?[ \t\r\n] matches with any username that is retweeted or tagged.
Now we will remove the string that match with these patterns with str_replace_all. It checks for matches with every seperate pattern in pat and looks for repeats of any pattern (hence, the addition of _all. Each match is replaced with an empty string, "" and thus the unwanted string is removed.
We replace the tweets_df columns text and retweet_text with the cleaned versions of the columns by using mutate. Then we print the head of the text column to view our changes.
tweets_df <- tweets_df %>%
mutate(text = str_replace_all(text, pattern = pat, "")) %>%
mutate(retweet_text = str_replace_all(retweet_text, pattern = pat, "")) %>%
mutate(text = str_replace_all(text, pattern = " ", " ")) %>%
mutate(retweet_text = str_replace_all(retweet_text, pattern = " ", " "))
head(unique(tweets_df$text))## [1] "No additional abortions need to be performed to continue creating the vaccine."
## [2] "The cells were obtained more than years ago as a result of elective abortions and today the cells are more than three generations removed from their origin. "
## [3] "Sad but true. Thanks anti-vaxers #antivax #VaccinesWork #VaccinateYourKids #science "
## [4] "Gizmo a -year-old male Rex cat came in despite the weather for his annual exam and vaccines. Great job Gizmo Good health takes priority. #rexcat #bothellpethosp #catsofinstagram #catsoftwitter #meow #vetmed #vettechs #animalspnw "
## [5] "My childrens daycare director received the polio vaccine in the s. Ended up getting it as a result. She has no feeling or movement in her legs and has to use crutches. Yet.....she still requires every child to be fully vaccinated. Bc shes smart and knows death is worse. "
## [6] "Deliberate disinformation is behind a massive measles outbreak in Washington "
Now that the raw tweet text is clean, let’s start manipulating the tweets with natural language processing techniques to gain insight about our data.
The tidy text format is using one-token-per-row in our dataframe. In short, tokens in natural language processing are words. To be technical, you could say a token is a list of characters between two spaces. Tokenization is the process of breaking up a string, paragraph, or document into tokens.
We want to tokenize our tweets, however we don’t want to lose any information. Specifically, we want to keep track of what Tweet each word came from since our analysis involves differentiating tweets. Other analyses including topic summarization may not require keeping track of this–you can just a create a jumble (i.e. “corpus”) of words and find commonalities.
In the code chunk below, we first create the dataframe text_df with the tweets_df column text and add the int column as an index for the tweets. Then, we tokenize the text data with tidytext’s unnest_tokens by using the text column as the input and word as the name of the output column. We replace text_df with this newly formatted dataframe.
unique_indices <- order(tweets_df$text)[!duplicated(sort(tweets_df$text))]
text_df <- data.frame("text" = tweets_df$text[unique_indices])
text_df$int <- as.numeric(unique_indices)
text_df$text <- as.character(text_df$text)
text_df <- text_df[order(text_df$int),]
row.names(text_df) <- 1:nrow(text_df)
head(text_df)## text
## 1 No additional abortions need to be performed to continue creating the vaccine.
## 2 The cells were obtained more than years ago as a result of elective abortions and today the cells are more than three generations removed from their origin.
## 3 Sad but true. Thanks anti-vaxers #antivax #VaccinesWork #VaccinateYourKids #science
## 4 Gizmo a -year-old male Rex cat came in despite the weather for his annual exam and vaccines. Great job Gizmo Good health takes priority. #rexcat #bothellpethosp #catsofinstagram #catsoftwitter #meow #vetmed #vettechs #animalspnw
## 5 My childrens daycare director received the polio vaccine in the s. Ended up getting it as a result. She has no feeling or movement in her legs and has to use crutches. Yet.....she still requires every child to be fully vaccinated. Bc shes smart and knows death is worse.
## 6 Deliberate disinformation is behind a massive measles outbreak in Washington
## int
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 7
## int word
## 1 1 no
## 1.1 1 additional
## 1.2 1 abortions
## 1.3 1 need
## 1.4 1 to
## 1.5 1 be
Now that we have our data in a “tidy” format, we want to remove stop words from the text. Stop words in NLP are common words that don’t add value to a sentence, paragraph, document etc. For example, “still think the flu shot is…” doesn’t give us any extra information than “still think flu shot,” so we can remove these words to make our dataset more concise.
To do this, simply load the stop_words dataset, which has two columns: word and lexicon. There are three different lexicons that reference the source of the stop word:
The difference between the lexicons is they include different words. Some include more than others, thus if we use these we remove more words and from our data. To quickly see these differences, we can count the number of stop words in each category using dplyr::group_by.
## # A tibble: 6 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## # A tibble: 3 x 2
## lexicon n
## <chr> <int>
## 1 onix 404
## 2 SMART 571
## 3 snowball 174
Since we have no prior reason to chose a specific lexicon, let’s remove all of the stop words in the stopwords dataset from our data using the anti_join function. We can always modify this later on.
## Joining, by = "word"
## int word
## 1 1 additional
## 2 1 abortions
## 3 1 need
## 4 1 performed
## 5 1 continue
## 6 1 creating
## 7 1 vaccine
## 8 2 cells
## 9 2 obtained
## 10 2 years
Another technique often used alongside removing stopwords is word stemming. Stemming changes each word to a root word so that different conjugations of the same word are not treated as different words completely. A human can tell that “vaccine”, vaccinate" and “vaccination” have close to the same meaning. However, for example, if we were to count every repetition of “vaccine” in R we would not get an accurate total since “vaccination” would be ignored. Instead, we change each of those words to “vaccin” with a stemming algorithm.
We will use the library SnowballC which implements the Porter stemming algorithm.
## int word
## 1 1 addit
## 2 1 abort
## 3 1 need
## 4 1 perform
## 5 1 continu
## 6 1 creat
## 7 1 vaccin
## 8 2 cell
## 9 2 obtain
## 10 2 year
Since we’ve filtered and removed the most common english words, let us also remove the most uncommon words in our dataset. This helps reduce the size of our data and remove meaningless words.
## # A tibble: 6 x 2
## word n
## <chr> <int>
## 1 "" 29
## 2 a.m 2
## 3 a.schiff 1
## 4 aaa 1
## 5 aaasmtg 1
## 6 aah 1
## # A tibble: 2,302 x 1
## word
## <chr>
## 1 a.schiff
## 2 aaa
## 3 aaasmtg
## 4 aah
## 5 aapi
## 6 abound
## 7 abovemajest
## 8 abstract
## 9 absurd
## 10 abt
## # ... with 2,292 more rows
## Joining, by = "word"
## [1] 23216
## [1] 20914
For a fun visual, we can visualize our tweets on a map. First let’s find which tweets use the word or hashtag “antivax” and which tweets say “vaccines work.” Next we add the column Hashtags that contains labels for if the tweet contains these words or contains neither.
anti_lst <- as.numeric(
grepl("antivax", tweets_df$text))
pro_lst <-as.numeric(
grepl("vaccines work", tweets_df$text))
head(anti_lst)## [1] 0 0 1 0 0 0
tweets_df$Hashtags <- "none"
tweets_df$Hashtags[anti_lst == 1] <- "#antivax"
tweets_df$Hashtags[pro_lst == 1] <- "vaccines work"Next we use ggplot2 and ggmap to plot this on a map.
states <- map_data("state")
ggplot(data = states) +
geom_point(data = tweets_df, aes(x = lng, y = lat, colour = Hashtags), size = 1) +
geom_polygon(aes(x = long, y = lat, group=group), fill = NA, color = "dark grey") +
theme_void()## Warning: Removed 4933 rows containing missing values (geom_point).
As we can there was sum missing data. This is expected since most people probably don’t have their location turned on on twitter. We calculate exactly what percentage of tweets do not have an associated latitude and longitude below.
## [1] 1.34
A common NLP data exploration practice is looking at n-gram frequencies. N-grams are sequences of n items/tokens. Common n-grams that are used in NLP are unigrams, bigrams, and trigrams, meaning one word, two word, and three word combinations.
We find the top 10 most common words or unigrams to see common topics in our data. Recall that we have already removed stopwords and unnecessary items such as web links, and we have stemmed the remaining words.
## # A tibble: 20 x 2
## word n
## <chr> <int>
## 1 vaccin 1726
## 2 measl 312
## 3 get 241
## 4 peopl 165
## 5 kid 160
## 6 anti 152
## 7 caus 142
## 8 on 131
## 9 diseas 129
## 10 autism 128
## 11 just 126
## 12 can 124
## 13 year 117
## 14 like 106
## 15 children 97
## 16 flu 97
## 17 immun 89
## 18 know 89
## 19 make 87
## 20 parent 85
We can graph the unigram frequencies with ggplot for a nice visualization of our text data. We pick a number of appearances to filter smaller than the total tweets we have (5,000) to plot only the top few words.
unigram_count %>%
filter(n>100) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
ylab(NULL) +
coord_flip()Next, we can look at common bigrams, which are two words in a row. The tidytext function unnest_tokens is an efficient way to find bigrams by setting token = "ngrams" and n = 2. The output column is named “bigram”.
To get the most frequent bigrams we can simply use the dplyr library’s count function again.
bigrams_stemmed <- text_df_stemmed %>%
unnest_tokens(bigram, word, token = "ngrams", n = 2)
head(bigrams_stemmed)## int bigram
## 1 1 addit abort
## 2 1 abort need
## 3 1 need perform
## 4 1 perform continu
## 5 1 continu creat
## 6 1 creat vaccin
## # A tibble: 10 x 2
## bigram n
## <chr> <int>
## 1 anti vaccin 75
## 2 vaccin caus 69
## 3 measl vaccin 65
## 4 caus autism 60
## 5 get vaccin 58
## 6 flu vaccin 42
## 7 anti vaxxer 32
## 8 measl outbreak 32
## 9 chicken pox 31
## 10 mmr vaccin 28
Last in this section, a fun exploration graphic of unigrams is a “word cloud.” We can create this data visualization easily with the wordcloud2.
We want to classify the tweets by sentiment, or rather by the feelings and emotions portrayed by the user. As humans we can easily tell if someone is angry, sad, satisfied, or even sarcastic. However we have to give the computer some clues on how to do this start with the sentiments dataset which is loaded with the tidytext library. Similar to the stop_words dataset, there are three lexicons:
afinn gives each word an integer score between -5 and 5, from most negative to most positive sentiment.
bing gives a binary “negative” or “positive” sentiment to each words.
nrc assigns sentiment labels of “positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.”
We can also use the tidytext’s get_sentiments() function to only get data from a particular lexicon.
## # A tibble: 6 x 4
## word sentiment lexicon score
## <chr> <chr> <chr> <int>
## 1 abacus trust nrc NA
## 2 abandon fear nrc NA
## 3 abandon negative nrc NA
## 4 abandon sadness nrc NA
## 5 abandoned anger nrc NA
## 6 abandoned fear nrc NA
## # A tibble: 6 x 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
“How were these sentiment lexicons put together and validated? They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data. Given this information, we may hesitate to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago. While it is true that using these sentiment lexicons with, for example, Jane Austen’s novels may give us less accurate results than with tweets sent by a contemporary writer, we still can measure the sentiment content for words that are shared across the lexicon and the text.” (note to self)
Next let’s merge some of our data with the sentiment lexicon “afinn” to explore our text.
nrc_example <- get_sentiments("afinn") %>%
filter(score == -2)
head(text_df %>%
inner_join(nrc_example) %>%
dplyr::count(word, sort = TRUE),10)## Joining, by = "word"
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 flu 97
## 2 risk 47
## 3 death 31
## 4 wrong 25
## 5 stupid 22
## 6 injury 21
## 7 misinformation 20
## 8 sick 19
## 9 problem 16
## 10 fear 14
We note that sentiment analysis would not be straight-forward. From the text printed above, we can’t exactly pinpoint if the user is being negative about vaccinations, or the effects of not getting vaccinated, or just being negative about the anti-vaccination or vaccination communities.
Let’s plot the average afinn scores for each tweet to explore the overall sentiment of our data.
## int word
## 1 1 additional
## 2 1 abortions
## 3 1 need
## 4 1 performed
## 5 1 continue
## 6 1 creating
tweet_sentiment <- text_df %>%
inner_join(get_sentiments("afinn")) %>%
dplyr::group_by(int) %>%
dplyr::mutate(text = paste0(word)) %>%
dplyr::summarize(sentiment_mean = mean(score))
tweet_sentiment_ordered <- tweet_sentiment[order(tweet_sentiment$sentiment_mean),]
tweet_sentiment_ordered$int <- factor(tweet_sentiment_ordered$int, levels = tweet_sentiment_ordered$int)
ggplot(tweet_sentiment_ordered, aes(sentiment_mean)) +
geom_histogram(show.legend = FALSE, binwidth = .3) +
ggtitle("Mean afinn sentiment score per tweet") +
geom_vline(xintercept = 0, linetype = "dashed", colour = "red") +
ylab(NULL)Remember when we said we should consider the negation issue? As in, how can we distinguish between the sentiment of “not like” and “like”?
Let’s print out some positive tweets. We can use the index column “int” that we’ve kept constant to print out the original (sans and Twitter handles) tweets.
## [1] "Gizmo a -year-old male Rex cat came in despite the weather for his annual exam and vaccines. Great job Gizmo Good health takes priority. #rexcat #bothellpethosp #catsofinstagram #catsoftwitter #meow #vetmed #vettechs #animalspnw "
## [2] "Not a fan of vaccines but I would have let him do it."
## [3] "Mandated vaccines are only good for shareholders. Take that to third world countries where it is needed."
## [4] "Amen. You definitely are Perhaps go back to college and learn. I wish you the best of luck."
## [5] "My mom is translating my grandfather's diary and we just learned that my great-grandfather passed away from tetanus in rural Iran. #VaccinesWork"
## [6] "SOTU on HIV a Great Start...Here's What Really Needs to Happen "
And let’s look at some negative ones as well..
negative <- tweet_sentiment$int[tweet_sentiment$sentiment_mean < -2 ]
head(tweets_df$text[negative])## [1] "- million children died every year from Measles before the vaccine. "
## [2] "He's talking about killing - percent of people if his vaccine programs do really well. "
## [3] "If you dont believe in #vaccination delete me. Cause Ill tear you apart with fucking science #vaccinate #VaccinateYourKids #science #vaccineswork #youreanidiot"
## [4] "People are sitting in jail for shaken baby convictions when the reality is that most of those cases are vaccine injury being covered up. Pharma continues to take lives even when it isnt directly related to their products. Doesnt that piss you off? "
## [5] "Or how about billion in damages paid out by the vaccine court? Does that count?"
## [6] "Hey What the hell did I miss. Vaccines aren't one size fits all. They are harming and killing babies. It is happening. "
Neither the positive nore negative tweets are unanimously on either side of this “issue”, even with this small sample of text. Now we can really see how this won’t be easy!
We should think about what we can do about the words that won’t be found in the sentiments dataset, like “#vaccineswork” that might give us even more information about whether or not the tweet is supporting vaccines or not.
Sentiment may be just a small part of clustering tweets as “pro-vaccination” or “anti-vaccination.”
Since our data is unsupervised, meaning we don’t have any prior knowledge or labels to tell us if a tweet is pro or anti vaccine, we want to use topic modeling to seperate the tweets into the two sides using only the text.
Latent Dirichlet allocation (LDA) is common topic modeling method: it views each document (or in our case, each tweet) as a mixture of topics, and each topic as a combination of words. A word can be part of multiple topics, thus allowing some overlap.
In order to apply LDA, we need to take our cleaned and “tidy” text data text_df and turn it into a document term matrix (DTM). It is easy to convert from the tidy text format to this matrix. First, we add a column counting the number of word repeats per tweet.
## # A tibble: 6 x 3
## # Groups: int [1]
## int word n
## <dbl> <chr> <int>
## 1 1 abort 1
## 2 1 addit 1
## 3 1 continu 1
## 4 1 creat 1
## 5 1 need 1
## 6 1 vaccin 1
Next, we simply use the command tidytext::cast_dtm to convert to a document term matrix, a structure where each row is a document and each column is a word.
## <<DocumentTermMatrix (documents: 1496, terms: 2066)>>
## Non-/sparse entries: 19261/3071475
## Sparsity : 99%
## Maximal term length: 23
## Weighting : term frequency (tf)
## [1] "abort" "addit" "continu" "creat" "need" "vaccin"
## [1] "1" "2" "3" "4" "5" "7"
Now we are ready to apply LDA to our DTM. We set k=2 because we are trying to create two categories of tweets: pro and anti vaccination. However, we should consider trying 3 categories later, the third being “other” tweets that are irrelevant to the debate. Note that we have no control over whether these will be the two or three clear categories, but it is the goal we are trying to achieve.
We also set an arbitrary seed to ensure we recreate the same results each time we run the code.
## A LDA_VEM topic model with 2 topics.
Now that we have our model results, we must do some exploring in order to interpret them.
As mentioned previously, each word can be in multiple topics. To examine the probability that a certain word belongs to one of the topics, we can retrieve the “beta” values. This can be done easily by providing the function tidy our topic model.
## # A tibble: 6 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 abort 0.00103
## 2 2 abort 0.0000247
## 3 1 addit 0.0000712
## 4 2 addit 0.000503
## 5 1 continu 0.000779
## 6 2 continu 0.000368
Notice this dataframe is again in the “tidy” format with one word/topic per row.
We can plot the most probable words for each topic by first creating a dataframe with the top 15 words for each topic.
Then we sort by the top terms and plot the data with ggplot2. In this case we filter by topic to show the differences.
top_terms <- topics_k2 %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, beta)
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = TRUE) +
coord_flip()Some words overlap in each topic, but this is okay (and even good) because we know words such as “vaccine” can be used by both sides of this debate.
To look at the most polarizing words, i.e. words with the greatest difference between the beta values for topic 1 and 2 we can take the log ratio. We filter the topics_k2 tidy dataframe to only have common words by setting topic_x > .001.
beta_spread <- topics_k2 %>%
mutate(topic = paste0("topic_", topic)) %>%
spread(topic, beta) %>%
filter(topic_1 > .001 | topic_2 > .001) %>%
mutate(log_ratio = log2(topic_1 / topic_2))
head(beta_spread,10)## # A tibble: 10 x 4
## term topic_1 topic_2 log_ratio
## <chr> <dbl> <dbl> <dbl>
## 1 "" 0.000573 0.00220 -1.94
## 2 abort 0.00103 0.0000247 5.38
## 3 actual 0.00103 0.00328 -1.67
## 4 advers 0.000380 0.00115 -1.60
## 5 advic 0.00110 0.0000469 4.55
## 6 ag 0.000560 0.00164 -1.55
## 7 also 0.00296 0.00239 0.310
## 8 anti 0.000283 0.0143 -5.66
## 9 antivax 0.00446 0.00165 1.43
## 10 antivaxx 0.00136 0.000548 1.31
Next we filter the log ratio to display just a few common terms with the largest differences between beta values. Note that since our log ratio is \(log(\frac{topic_1}{topic_2})\), if the beta value for topic 1 is larger, the log ratio will be positive. Again this means that the particular word is more likely to fall in topic 1 instead of 2. The reverse is true (if \(\beta_{t2} > \beta_{t1}\), then the log is negative).
beta_spread %>%
filter(log_ratio > 4.5 | log_ratio< -4.5) %>%
mutate(term = reorder(term, log_ratio)) %>%
ggplot(aes(term, log_ratio)) +
geom_col(show.legend = FALSE) +
coord_flip()Examine group probabilities by tweet
Now that we have examined the beta values, or probability a word is in each topic, we now look at the gamma \((\gamma)\) probabilities or “per-document-per-topic probabilities.”
We create a new dataframe using tidy() again and observe how it is arranged: for each topic and document combination (for this dataset, each document is a tweet), there is a gamma value between 0 and 1.
Next, we look at the tweets with the most extreme \(\gamma\) values to get a sense of what we are working with.
## # A tibble: 6 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 1 1 0.509
## 2 2 1 0.512
## 3 3 1 0.491
## 4 4 1 0.519
## 5 5 1 0.523
## 6 7 1 0.501
## # A tibble: 6 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 976 2 0.554
## 2 923 1 0.549
## 3 899 2 0.543
## 4 218 1 0.543
## 5 754 1 0.540
## 6 1046 1 0.539
This doesn’t tell us that much until we actually read the tweets in relation to their assigned topics.
Let’s print the 20 tweets with the largest \(\gamma\) scores,meaning they are the most skewed towards one topic or the other. We do this using the kableExtra library to make the twitter text nice and readable. We can also add a scroll_box to make a window that can be scrolled through when this file is knit to a HTML file.
diff_tweets<-tweets_df$text[as.numeric(lda_tweets$document[1:20])]
diff_tweets<-data.frame(diff_tweets)
diff_tweets$topic <- as.numeric(lda_tweets$topic[1:20])
kable(diff_tweets) %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "750px", height = "300px")| diff_tweets | topic |
|---|---|
| This paper describes a case study in which an engineered man-made oncolytic virus built on a measles vaccine vector was targeted to multiple myeloma cells and promoted tumor regression in a single patient. Engineered oncolytic viral therapy wild type measles virus. | 2 |
| Im and I received both shingles vaccines in an effort not to get them. After all these years Im still getting vaccines. Smallpox is eradicated because of it. Polio for the most part. And they want to encourage no vaccines? This woman is cray cray. | 1 |
| Thank you for that information. Out of M plus ppl weve had deaths. Yet more ppl were killed by guns this month and its the th. If there is a risk there should be a choice w/vaccines. Ppl should have access to all info then decide. They dont. Only Big Pharma info. | 2 |
| While Im glad you want to learn more you need to be careful about your sources. Vaxxed the movie is grossly inaccurate - the director Andrew Wakefield is the now-delisted doctor whose unethical work started the antivax movement. This may help | 1 |
| Because Polio. You understand the reason no one gets polio anymore is because of a vaccine right? You understand MMR right? Measles Mumps and Rubella. The reason why no one was getting them is because vaccines people stopped getting vaccines and now these sicknesses are back. | 1 |
| Anti Vaxxers are Looney Tunes Andrew Wakefield is a fraud and a conman. The real reason this dr was against proven vaccines is bc he had a company that had a competing vaccine. He is no longer a Dr. | 1 |
| I think vaccines prevent serious illness and I think youre out of your mind if you dont vaccinate your kids. Lets say they may cause autism as you claim which they dont as any sane person knows. There are many wonderful productive people on the spectrum. ALIVE | 2 |
| Why would anyone take vaccination advice from me?I’m not saying take MY advice.I’m saying take the advice OF THE FUCKING OVERWHELMING BODY OF SCIENTIFIC RESEARCH EVIDENCE AND HISTORICAL BEATDOWN OF DISEASES LIKE MEASLES THANKS TO VACCINES.Or stick to your feelings. Sure. | 1 |
| The vaccine schedule has increased exponentially and begins on the day of birth. Infants are getting shots up every months with multiple vaccines for years. | 1 |
| Social media bots and Russian trolls have been spreading disinformation about vaccines on Twitter to create social discord and distribute malware US researchers say. #nhpolitics | 1 |
| My comment clearly stated before the measles vaccine - developed in the ’s - Shine’s comment clearly referenced Baby Boomers - people born from WW to the mid ’s. My comment was refuting the assertion that Baby Boomers were not harmed by Measles. Try reading next time. | 2 |
| Simple. LOOK IN TO what the vaccines and immunizations contain BEFORE you submit your child or self to them. Question get real answers - NOT from the Doctors who have to peddle them NOT from your pharmacist who also has to peddle certain ones - from the sources directly only. | 1 |
| Methodist Hospital researchers find flesh-eating bacteria genetic roadmap hope it leads to vaccine | 1 |
| Good on for going a step further and debunking the false claim that childhood diseases prevent cancer. She writes of the study incorrectly referenced The irony? The virus that was given as part of the therapy was structured similarly to a measles vaccine. | 2 |
| I think you don’t understand how vaccines work. Detecting virus in vaccinated is how you become immune generally without symptoms. The rate of post-vaccine symptoms is WAY lower than the rate of child mortality in those infected by traditional means. | 2 |
| As a man on the spectrum I consider vaccines cause autism to be hate speech not just science denial.These folks would rather their kids be blind and brain-damaged than autistic. In fact they’d rather other people’s kids be blind and brain-damaged than theirs be autistic. | 2 |
| There are far more reasons to be #antivax that to be provax. Fact is there has been little real research into vax-related side effects. Virtually all vaccines enter use with little to no actual testing…just like Roundup entered commercial use with very little testing. | 1 |
| Im saying parents need to be aware of the effects of certain vaccines. Some vaccines dont require you to take them until later in life. Im at work so I cant grab that info now but I think the tetanus shot can be administered at around years old. / | 1 |
| Vaccine choice doesn’t mean no one gets vaccinated.WA has had vaccine choice for at least years and only about aren’t vaccinate…Dr. Bark states that about of the population is genetically at risk for serious complications from MMR.My stats are solid. | 1 |
| There are a lot of dumb people on the internet but anti-vaxxers have got to be the dumbest.No vaccines do not cause autism.No vaccine preventable illnesses are not trivial.Quit demonizing autistic people to push your snake oil. And quit causing kids to die. | 2 |
It’s not obvious that there is a distinct difference between the two topic categories. In fact, there are some tweets that are clearly pro-vaccination in topic 2 but I also see some that are clearly anti-vaccination. This may be because this kind of topic modeling only looks at individual words such as “vaccine,” and not at bigrams or trigrams such as “vaccines work” and “vaccines cause autism.”
So let’s run a similar analysis but instead of starting with the one-token-per-row format, we start with the bigrams dataframe we created earlier. Take a look at the dataframe we just used for our analysis compared to bigrams:
## int word
## 1 1 addit
## 2 1 abort
## 3 1 need
## 4 1 continu
## 5 1 creat
## 6 1 vaccin
## int bigram
## 1 1 addit abort
## 2 1 abort need
## 3 1 need perform
## 4 1 perform continu
## 5 1 continu creat
## 6 1 creat vaccin
The format is quite similar. Let’s proceed in converting bigrams to a document term matric (DTM).
## # A tibble: 6 x 3
## # Groups: int [1]
## int bigram n
## <dbl> <chr> <int>
## 1 1 abort need 1
## 2 1 addit abort 1
## 3 1 continu creat 1
## 4 1 creat vaccin 1
## 5 1 need perform 1
## 6 1 perform continu 1
## <<DocumentTermMatrix (documents: 1490, terms: 17508)>>
## Non-/sparse entries: 21515/26065405
## Sparsity : 100%
## Maximal term length: 188
## Weighting : term frequency (tf)
## [1] "abort need" "addit abort" "continu creat" "creat vaccin"
## [5] "need perform" "perform continu"
Now we apply LDA to this DTM as before.
This time, let’s try using k = 3 to see if we can capture the following three topics:
## A LDA_VEM topic model with 3 topics.
We get the beta values as before.
## # A tibble: 6 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 abort need 2.74e-177
## 2 2 abort need 1.33e-177
## 3 3 abort need 1.39e- 4
## 4 1 addit abort 2.64e-177
## 5 2 addit abort 1.69e-179
## 6 3 addit abort 1.39e- 4
bigram_gamma <- tidy(lda_bi_k3, matrix = "gamma")
bigram_gamma<-bigram_gamma %>% arrange(desc(gamma))
head(bigram_gamma)## # A tibble: 6 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 121 3 0.999
## 2 102 1 0.999
## 3 4733 2 0.999
## 4 4880 3 0.999
## 5 494 2 0.999
## 6 4918 2 0.998
Wow! These \(\gamma\) values are much higher than when we used unigrams and 2 topics. Let’s read the tweets to see how it did.
diff_tweets<-tweets_df$text[as.numeric(bigram_gamma$document[1:20])]
diff_tweets<-data.frame(diff_tweets)
diff_tweets$topic <- as.numeric(bigram_gamma$topic[1:20])
kable(diff_tweets) %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "750px", height = "300px")| diff_tweets | topic |
|---|---|
| I just got the HPV vaccine because ya boy doesnt need dick cancer and the doctor was like you might get light headed afterwords and Im like yeah okay whatever.. homie stuck that needle in my arm and I hit that exam chair blacked tf out like a rock for like seconds | 3 |
| ur a trollOr u dont know why u hold these beliefs but someone told you that vaccines cause autism it played into ur unhealthy sense of paranoia when confronted w/ facts instead of realizing u have no idea what ur talking abt u lash out ppl as a defense mechanism | 1 |
| Lets use the HPV vaccine-what happened when group got HPV vaccine group got the adjuvant amorphous aluminum hydroxyphosphate sulfate group got saline?DEATHS in the HPV AAHS groupsNo deaths in the saline group#AskWhyPharma doesnt want saline placebo studies | 2 |
| anti-vaxxers are kinda confusing to me bc. its been scientifically proven that vaccines don’t even cause autism. even if they did would u really wanna risk the death of ur kid just bc u dont want them to have a mental disabilityi mean the choice seems pretty obvious to me | 3 |
| Dumb and WrongDarla Shine Wife of Top Trump Official Bill Shine Goes on Pro-Measles Anti-Vax Rant Bring back our #ChildhoodDiseases the former Fox News producer wrote they keep you healthy fight cancer. Surprise shes dangerously wrong. | 2 |
| Baby boomers also gave us the climate crisis expensive for-profit healthcare the Great Recession stupid anti-vaccine movements wealth disparity and overall one big mess of a world millennials have to fight through each day to fix and improve. #VoteBlue #ScienceisLife | 2 |
| Nobody is saying no one gets serious problems from vaccines but that number of people is so minuscule compared to the risk of I dunno bringing back a disease we virtually eradicated decades ago that we can handle one case out of one million having a semi-serious side effect | 3 |
| This paper describes a case study in which an engineered man-made oncolytic virus built on a measles vaccine vector was targeted to multiple myeloma cells and promoted tumor regression in a single patient. Engineered oncolytic viral therapy wild type measles virus. | 1 |
| Now you’re acutely aware of the dangers. Maybe now you harbour thoughts in the back of your mind about the possibility. You can even meta-analyse patient data. I would look for increased dementia post flu vaccine and sudden onset of Lupus or similar autoimmunity in post Gardisil | 1 |
| Just a another Trumper that has no medical knowledge so many diseases are preventable with the proper vaccine sequences. Thinking its a good thing to let children get measles chicken pox mumps etc. is sick and totally stupid. Stop pedaling you bullshit without knowing. | 1 |
| Uhhh what? That’s just not how science works. My year old had her latest round of shots Monday. Was she sad? Yes. Will she grow up to avoid preventable diseases? Also yes. Will she be told vaccines work and people who say they’re dangerous/unnecessary are WRONG? SO MUCH YES. | 1 |
| Facebook mom I dont believe autism isnt caused by vaccines. The studies by actual scientists arent really accurateAlso Facebook mom I believe I can make a month selling this shitty makeup because a woman named Susan Ive never met told me I could | 2 |
| Measles MMR vaccine protects against measles/mumps/rubella.Measles complications incl blindness encephalitis severe diarrhea/ dehydration ear infection resp infections like pneumonia.Measles harmful to unborn baby in womb children can die years later of complication. | 2 |
| There’s still time to get your #flu shot Getting a #flu vaccine isnt just about keeping you healthyit also helps protect the people around you from flu. Get your flu shot now and help to protect your family friends community for the rest of flu season. | 2 |
| A teenager who Clearly Does Not Belong Here approaches a burnt-out streetlight. An older man meets him.Teen Y-you got the stuff?Man You got the money?Teen flashes cashMan Then I got whatever you need. opens coat to reveal vaccines Flu Measles Whooping Cough… | 3 |
| His father was head of Planned Parenthood? Wow. Then he goes on TED and talks about lowering population thru better health care abortion and vaccines. When we wake up years from now and see the fertility issues caused by things like the HPV vaccine it will be a little late. | 3 |
| Autistic disorder change point years are coincident with intro of vax’s manu’d using human fetal cell lines containing fetal and retroviral contaminants into childhood vaccine regimens. Thus rising autistic disorder prevalence is directly related to vaccines manufactured | 3 |
| Lady at the health department told me my sons shot record for vaccines was beautiful. My kid just turned months old hes crawling months early and sitting up on his own months early already trying to stand months early. But vAcCiNeS cAuSe AuTiSm | 2 |
| No way can you ever fix stupid not even with duct tape Vaccines saves lives If the USA didnt have these vaccines in place for the masses the United States would be a third world country Wake up America and make sure youre vaccinated and your children are vaccinated | 3 |
| My son is months old and it’s another months before I can put him in a day care without worrying about him contracting Measles. My state requires a small class on the benefits for the vaccine to get a waiver for school/day care. I love being a STAHM but dammit I wanna work | 3 |
Here I will compare the results from the topic modeling to the social network approach. We will see if the same tweets were clustered into the same communities. We will map the results to see if there is a geographical trend.
Social network analysis
Assuming that, with high probability, users do not change their views (especially within a 7-day window) we can label Twitter users and represent them as a social network to help classify tweets.
To do this, we will use the power of retweets. We use Retweets because we can assume that if a user retweets another users tweets about vaccination, they probably agree with their opinion. This assumption gets stronger as the social network grows more connections.
We will build a network using the
igraphpackage. This network will have users as nodes or vertices, with an edge between two vertices if the users retweet the same tweet, are following or are friends with the other user, or retweeted a tweet directly from the other user. The weight of the edge will increase if more than one of the previous are true.First, we should create a data frame where the first two columns are vertices with an edge between them.
add attributes if there are multiple connections necessary.
Now we have each tweet on the list with the associated users. Now let’s create an edge between all these users by creating the dataframe that will be converted to a graph object with
igraph.Let’s delete the extra room
So now we’ve connect each user to all of its tweets, and each unique tweet to all unique users that retweeted it.
Set remaining vertices to -1 so they are ignored (initialized vertices)
Run label propagation, Add vertex attributes and color the largest communities.
Data visualization: color the largest communities.
Now that we have created all the labels we can display the graph with
graphjs.